[SPARK-21472][SQL] Introduce ArrowColumnVector as a reader for Arrow vectors. by ueshin · Pull Request #18680 · apache/spark

ueshin · 2017-07-19T10:52:45Z

What changes were proposed in this pull request?

Introducing ArrowColumnVector as a reader for Arrow vectors.
It extends ColumnVector, so we will be able to use it with ColumnarBatch and its functionalities.
Currently it supports primitive types and StringType, ArrayType and StructType.

How was this patch tested?

Added tests for ArrowColumnVector and existing tests.

ueshin · 2017-07-19T10:55:36Z

cc @BryanCutler @kiszk @cloud-fan

kiszk · 2017-07-19T11:08:39Z

+  @Override
+  public boolean[] getBooleans(int rowId, int count) {
+    assert(dictionary == null);
+    NullableBitVector.Accessor accessor = boolData.getAccessor();


Can we use nulls? Ditto for other places.

I'm afraid not, because the type of nulls is ValueVector.Accessor which has only simple methods such as isNull().
The concrete accessor APIs are different for each types.
Or should we cast nulls to the concrete type each time?

I see. Can we keep NullableBitVector.Accessor instead of NullableBitVector while we keep the same reference in two instance variables. I am afraid about the cost of runtime cast in getBoolean() method rather than getBooleans() method.
This is why I expect get() method will be inlined into by a JIT compiler since each Accessor class is final.

kiszk · 2017-07-19T11:09:05Z

+
+  @Override
+  public boolean getBoolean(int rowId) {
+    return boolData.getAccessor().get(rowId) == 1;


Can we use nulls? Ditto for other places

SparkQA · 2017-07-19T13:19:41Z

Test build #79752 has finished for PR 18680 at commit 73899b2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public final class ArrowColumnVector extends ColumnVector

kiszk · 2017-07-19T17:34:38Z

+ */
+public abstract class ReadOnlyColumnVector extends ColumnVector {
+
+  protected ReadOnlyColumnVector(int capacity, MemoryMode memMode) {


Is there any reason not to accept dataType as one of argument? To have the argument would be more flexible for future usages.

I see, I'll modify it to accept dataType but I guess we shouldn't pass it to ColumnVector to avoid illegally allocating child columns.

SparkQA · 2017-07-19T17:58:52Z

Test build #79763 has finished for PR 18680 at commit ddfcf36.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

Thanks @ueshin for this. I made a first pass, I see a lot of things are scoped to public - is this intended to be a public API?

BryanCutler · 2017-07-19T17:37:59Z

+    case _ => throw new UnsupportedOperationException(s"Unsupported data type: $dt")
+  }
+
+  def toArrowField(name: String, dt: DataType, nullable: Boolean): Field = {


Is this only used for testing?

No, this is used to create an Arrow schema from StructType in ArrowUtils .toArrowSchema(), too.

BryanCutler · 2017-07-19T17:40:52Z

+
+import org.apache.spark.sql.types._
+
+object ArrowUtils {


shouldn't this be private[sql]? also in other places

BryanCutler · 2017-07-19T17:42:28Z

+/**
+ * A column backed by Apache Arrow.
+ */
+public final class ArrowColumnVector extends ReadOnlyColumnVector {


Is this planned to be a public API right now?

BryanCutler · 2017-07-19T17:45:56Z

+      }
+      resultStruct = new ColumnarBatch.Row(childColumns);
+    } else {
+      throw new UnsupportedOperationException();


Can this whole "if else" block be put into a pattern match instead?

Unfortunately, this class is written in Java, so we can't use a pattern match.

BryanCutler · 2017-07-19T17:55:14Z

+/**
+ * An abstract class for read-only column vector.
+ */
+public abstract class ReadOnlyColumnVector extends ColumnVector {


Wouldn't it be better to refactor ColumnVector into classes that separate reading/writing so you could just extend the read portion instead of making this class that throws exceptions on writes? e.g.

ColumnVector -> ColumnVectorWritable -> ColumnVectorReadable
ArrowColumnVector -> ColumnVectorReadable

I agree that it'd be better to refactor ColumnVector, but I think ColumnVector is related to ColumnarBatch or other classes, so we should do it, and also refactor ColumnarBatch at the same time, in the future PRs.

+1 on separating the read/write, we should definitely do this before we publish the ColumnVector interfaces.

…gument.

ueshin · 2017-07-20T05:03:37Z

@BryanCutler Thank you for reviewing!
As for scope, yes, I'd like these APIs to be public. Do you have any concerns about it?

SparkQA · 2017-07-20T06:50:16Z

Test build #79787 has finished for PR 18680 at commit 91b94ef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-07-20T07:47:51Z

@BryanCutler all classes under the execution package are meant to be private, in the future we will move them to a new package if we are ready to public them.

cloud-fan · 2017-07-20T07:59:57Z

+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * A column backed by Apache Arrow.


nit: a column vector

cloud-fan · 2017-07-20T08:06:58Z

+  public boolean[] getBooleans(int rowId, int count) {
+    boolean[] array = new boolean[count];
+    for (int i = 0; i < count; ++i) {
+      array[i] = accessor.getBoolean(rowId + i);


we don't need to address this now, but do we have a better implementation with arrow? cc @BryanCutler

kind of a batch read API.

I checked Arrow's API docs. I didn't find batch read API.

cloud-fan · 2017-07-20T08:09:51Z

+      childColumns = new ColumnVector[1];
+      childColumns[0] = new ArrowColumnVector(listVector.getDataVector());
+      resultArray = new Array(childColumns[0]);
+    } else if (vector instanceof MapVector) {


a unrelated question: why a vector for struct type is called MapVector in arrow? cc @BryanCutler

I'm not sure about the design decision behind it, but it's meant to lookup child vectors by name so uses a kind of hash map. I agree that another name would have been more intuitive.

viirya · 2017-07-20T08:20:18Z

+
+    @Override
+    final int getArrayLength(int rowId) {
+      return accessor.get(rowId + 1) - accessor.get(rowId);


If the given rowId is the last row, is it still valid to call get(rowId + 1)?

Yes, the offset vector for ListVector should have num of arrays + 1 values.

cloud-fan · 2017-07-20T11:41:10Z

LGTM, pending jenkins

SparkQA · 2017-07-20T12:45:14Z

Test build #79793 has finished for PR 18680 at commit 2d1dad9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-07-20T13:00:41Z

thanks, merging to master!

rxin · 2017-07-20T20:28:05Z

Have you guys checked the performance of this change? It changes the number of concrete implementations for column vector from 2 to 3 (and potentially 1 to 2 at runtime). This might (or might not) have huge performance implications because it might disable inlining, or force virtual dispatches. (It depends on how we can column vector).

viirya · 2017-07-21T07:53:41Z

cc @ueshin http://apache-spark-developers-list.1001551.n3.nabble.com/Fwd-spark-git-commit-SPARK-21472-SQL-Introduce-ArrowColumnVector-as-a-reader-for-Arrow-vectors-tc22003.html

ueshin · 2017-07-21T08:51:16Z

@viirya and the original reporter, thank you for reporting it!
I submitted a follow-up pr #18701.

… for Arrow vectors. ## What changes were proposed in this pull request? This is a follow-up of #18680. In some environment, a compile error happens saying: ``` .../sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ArrowColumnVector.java:243: error: not found: type Array public void loadBytes(Array array) { ^ ``` This pr fixes it. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #18701 from ueshin/issues/SPARK-21472_fup1.

…ctor type. ## What changes were proposed in this pull request? As mentioned at apache#18680 (comment), when we have more `ColumnVector` implementations, it might (or might not) have huge performance implications because it might disable inlining, or force virtual dispatches. As for read path, one of the major paths is the one generated by `ColumnBatchScan`. Currently it refers `ColumnVector` so the penalty will be bigger as we have more classes, but we can know the concrete type from its usage, e.g. vectorized Parquet reader uses `OnHeapColumnVector`. We can use the concrete type in the generated code directly to avoid the penalty. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes apache#18989 from ueshin/issues/SPARK-21781.

ueshin added 2 commits July 19, 2017 19:45

Import ArrowUtils and use it.

689f86f

Introduce ArrowColumnVector as a reader for Arrow vectors.

73899b2

kiszk reviewed Jul 19, 2017

View reviewed changes

ueshin added 3 commits July 19, 2017 23:01

Extract ReadOnlyColumnVector.

c912e78

Refactor ArrowColumnVector.

2215922

Add tests to check getting multiple values.

ddfcf36

kiszk reviewed Jul 19, 2017

View reviewed changes

BryanCutler reviewed Jul 19, 2017

View reviewed changes

Modify ReadOnlyColumnVector to accept dataType for the constructor ar…

91b94ef

…gument.

cloud-fan reviewed Jul 20, 2017

View reviewed changes

viirya reviewed Jul 20, 2017

View reviewed changes

ueshin added 2 commits July 20, 2017 17:35

Fix a comment.

afdaf5a

Add boundary check.

2d1dad9

cloud-fan mentioned this pull request Jul 20, 2017

[SPARK-20783][SQL] Create CachedBatchColumnVector to abstract existing compressed column #18468

Closed

asfgit closed this in cb19880 Jul 20, 2017

ueshin mentioned this pull request Jul 21, 2017

[SPARK-21472][SQL][FOLLOW-UP] Introduce ArrowColumnVector as a reader for Arrow vectors. #18701

Closed

ueshin mentioned this pull request Aug 18, 2017

[SPARK-21781][SQL] Modify DataSourceScanExec to use concrete ColumnVector type. #18989

Closed


		import org.apache.spark.sql.types._

		object ArrowUtils {

Conversation

ueshin commented Jul 19, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

ueshin commented Jul 19, 2017

Uh oh!

kiszk Jul 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 19, 2017

Uh oh!

kiszk Jul 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 19, 2017

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ueshin commented Jul 20, 2017

Uh oh!

SparkQA commented Jul 20, 2017

Uh oh!

cloud-fan commented Jul 20, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jul 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jul 20, 2017

Uh oh!

SparkQA commented Jul 20, 2017

Uh oh!

cloud-fan commented Jul 20, 2017

Uh oh!

rxin commented Jul 20, 2017

Uh oh!

kiszk Jul 19, 2017 •

edited

Loading

kiszk Jul 19, 2017 •

edited

Loading

cloud-fan Jul 20, 2017 •

edited

Loading